Real-Time Emotion Detection with Face Using Deep Learning

Authors: Harsh Kumar, Anuttama Dwivedi, Aayush Singh, Anuj Singh

DOI Link: https://doi.org/10.22214/ijraset.2026.80335

Abstract

Facial emotion recognition is essential for building smarter human–computer interfaces. This project introduces FeelTrack, a real-time emotion detection system powered by a lightweight custom CNN. Unlike earlier methods that used handcrafted features like LBP and HOG, which struggled with poor lighting or head tilts, our approach leverages deep learning for better accuracy and adaptability. To reduce dataset bias, the training set mixes FER2013 with an additional collection of Indian facial images. The model was trained on 35,887 FER2013 images and 3,200 custom Indian facial images, evaluated using an 80–20 train-test split helping the model adapt better across different demographics.With preprocessing (grayscale conversion, resizing, normalization), dropout layers, and an optimized architecture, FeelTrack achieves ~80% accuracy while processing frames in under 100 ms on standard hardware i.e. no GPU required. Tested live via webcam, it reliably detects seven emotions: happy, sad, angry, surprise, fear, disgust, and neutral. Ideal for classroom engagement tracking, mental health monitoring, and interactive systems, FeelTrack proves that practical, accessible emotion AI is within reach using efficient deep learning.

Introduction

Facial Emotion Recognition (FER) is an important area of affective computing that enables machines to detect human emotions from facial expressions, improving human–computer interaction across domains like healthcare, education, and surveillance. Early FER systems relied on handcrafted techniques such as Local Binary Patterns, which worked well in controlled settings but performed poorly in real-world conditions due to variations in lighting, pose, and background.

With the rise of deep learning, especially Convolutional Neural Networks, FER systems achieved higher accuracy and robustness by automatically learning facial features. Lightweight CNN models further enabled real-time emotion recognition on devices with limited resources. However, challenges such as dataset bias, lack of demographic diversity, privacy concerns, and high computational requirements still persist.

The literature shows continuous improvements through advanced methods like attention-based models, transfer learning, and hybrid approaches, but many high-accuracy systems remain resource-intensive and less suitable for real-time use.

To address these issues, the proposed system “FeelTrack” introduces a lightweight CNN-based FER model that balances accuracy, speed, and efficiency. It incorporates diverse datasets, including Indian facial data, to reduce bias and improve generalization. The system achieves around 80% accuracy with low latency (under 100 ms) and supports real-time emotion detection using a webcam.

Overall, the study highlights the evolution of FER from traditional methods to modern deep learning approaches and emphasizes the need for efficient, fair, and deployable systems for real-world applica

Conclusion

Facial Emotion Recognition (FER) has become a significant component of modern affective computing systems, enabling machines to interpret human emotional states through visual facial cues. The concept of affective computing was first introduced by Picard [29], [30], who emphasized the importance of incorporating emotional intelligence into computational systems to improve human–computer interaction. Since then, FER research has expanded rapidly across multiple domains, including healthcare, education, human–robot interaction, surveillance, and online communication [5], [9], [20], [38]. Facial expressions provide a universal and intuitive medium for conveying emotional states across cultures. Automating the recognition of facial expressions requires the extraction of discriminative patterns related to facial motion, geometry, and texture, which are subsequently mapped to emotion categories such as happiness, sadness, anger, fear, disgust, surprise, and neutrality [27], [40]. Early FER systems relied heavily on handcrafted features such as Local Binary Patterns (LBP) and Gabor filters [40]. While effective in controlled environments, these approaches demonstrated limited robustness in real-world scenarios involving variations in illumination, head pose, occlusion, and demographic diversity [18]. The emergence of deep learning, particularly Convolutional Neural Networks (CNNs), significantly transformed FER methodologies. CNN-based models automatically learn hierarchical feature representations from raw image data, resulting in improved accuracy and robustness compared to traditional approaches [17], [19]. Advances in lightweight CNN architectures further enabled real-time FER on resource-constrained devices [13], [38]. Despite these improvements, challenges related to dataset bias, generalization, privacy preservation, and ethical considerations remain unresolved [25], [30], [36]. This work investigates the evolution of FER systems from handcrafted feature-based methods to modern deep learning approaches. It also proposes a lightweight CNN-based system designed to balance accuracy, computational efficiency, and real-world deployability while addressing dataset diversity concerns. A. Contributions of This Work • Design of a lightweight CNN architecture for real-time FER • Inclusion of an Indian facial expression dataset to reduce demographic bias • CPU-based inference with latency below 100 ms • Comparative analysis with computationally intensive FER models II. LITERATURE SURVEY Study/ Model Year Method/ Architecture Dataset(s) Used Accuracy (%) Key Highlights Zhao & Pietikäinen [40] 2007 LBP + Dynamic Texture CK+ 84.5 Early handcrafted method; strong on static images but weak in real-world conditions. Liu et al. [21] 2016 CNN Ensemble FER2013 89.2 Demonstrated CNNs outperforming traditional classifiers. Akhand et al. [1] 2021 Transfer Learning + Deep CNN FER2013 93.5 Used pre-trained models to reduce training time and improve accuracy. Minaee et al. [25] 2021 Deep-Emotion (Attention-based CNN) AffectNet 94.1 Focused attention on key facial regions for better precision. Li et al. [20] 2022 Lightweight CNN FER2013 94.6 Optimized for real-time performance on edge devices. Han et al. [13] 2020 GhostNet (Depthwise Separable Conv) FER2013 92.8 Reduced model size and computation using ghost modules. Mishra et al. [26] 2023 EfficientNet + XGBoost (Hybrid) FER2013 95.3 Combined deep features with ML for high accuracy and interpretability. Kong et al. [17] 2021 Attention + Key Region Fusion CNN FER2013 93.1 Improved focus on emotionally salient areas. Xu et al. [38] 2022 Residual Separable CNN FER2013 91.7 Enhanced feature extraction with residual connections. John et al. [16] 2020 MobileNetV3-based Real-time System FER2013 90.4 Achieved real-time inference with low latency on mobile CPUs. Table I. Comparison Of Facial Emotion Recognition Models Facial Emotion Recognition (FER) is a fundamental research topic in affective computing and human–computer interaction. The conceptual foundation of affective computing was introduced by Picard [29], [30], who emphasized that computational systems should be capable of perceiving and responding to human emotions. Early FER approaches primarily relied on handcrafted feature extraction techniques to represent facial appearance and motion. Zhao and Pietikäinen [40] employed Local Binary Patterns (LBP) with dynamic texture analysis for recognizing facial expressions in image sequences. While effective in controlled environments, such handcrafted methods exhibited limited robustness to variations in illumination, head pose, occlusion, and background complexity. The availability of benchmark datasets significantly influenced the progress of FER research. Lucey et al. [23] introduced the Extended Cohn–Kanade (CK+) dataset, which became a widely used benchmark for evaluating FER systems under laboratory conditions. To address real-world variability, Mollahosseini et al. [27] proposed AffectNet, a large-scale dataset containing facial expressions collected in unconstrained environments. However, several studies reported that existing datasets suffer from class imbalance and limited demographic diversity, which negatively affect model generalization across different populations and cultural contexts [8], [18], [37]. With advances in deep learning, Convolutional Neural Networks (CNNs) became the dominant approach for FER. Liu et al. [21] demonstrated that CNN ensemble models outperform traditional machine learning classifiers by automatically learning hierarchical facial representations. Zhou et al. [39] further validated the effectiveness of lightweight CNN architectures for real-time emotion recognition. Transfer learning techniques were subsequently adopted to improve recognition performance and reduce training cost. Akhand et al. [1] utilized pre-trained CNN models to enhance accuracy on the FER2013 dataset while achieving faster convergence compared to training from scratch. Several comparative studies confirmed that deep learning-based FER systems consistently outperform conventional classifiers such as Support Vector Machines and Random Forests in terms of accuracy and robustness [4], [15]. To improve discriminative performance, attention-based CNN architectures were introduced. Minaee et al. [25] proposed an attentional convolutional network that selectively focuses on emotionally salient facial regions, resulting in improved recognition accuracy. Nevertheless, attention-based and deep architectures often incur increased computational complexity, limiting their applicability in real-time and resource-constrained environments. To address efficiency constraints, researchers proposed lightweight CNN architectures optimized for real-time deployment. Ding et al. [7] introduced MobileFaceNet, a compact deep learning model designed for efficient facial analysis on mobile devices. Han et al. [13] proposed GhostNet, which reduces computational cost by generating feature maps using inexpensive operations. John et al. [16] developed a MobileNetV3-based FER system that achieved low inference latency on mobile CPUs. Although these lightweight models provide a favorable trade-off between accuracy and efficiency, most evaluations were conducted on benchmark datasets with limited consideration of demographic diversity. Recent studies explored advanced and hybrid approaches to enhance FER performance. Kong et al. [17] integrated attention mechanisms with key facial region fusion to improve sensitivity to subtle expressions. Mishra et al. [26] combined deep feature extraction with XGBoost classification, achieving high recognition accuracy at the cost of increased computational overhead. Additionally, multimodal emotion recognition approaches incorporating facial cues with physiological or behavioral signals demonstrated improved robustness under occlusion and adverse lighting conditions [28], [35]. Despite significant progress, several open challenges remain. Dataset bias and the underrepresentation of non-Western populations continue to limit fairness and generalizability in FER systems [30], [32]. Ethical concerns related to privacy, consent, and cultural misinterpretation of emotions have also been emphasized [24]. To mitigate privacy risks, Verma et al. [36] investigated privacy-preserving techniques such as secure model architectures that reduce exposure of sensitive facial data. In summary, existing FER research demonstrates substantial improvements through deep learning, attention mechanisms, and lightweight architectures. However, limitations persist in computational efficiency, dataset diversity, and real-world deployability. These challenges motivate the development of FER systems that balance accuracy, latency, and fairness, particularly for deployment on low-resource devices and culturally diverse populations. III. EXISTING SYSTEMS Existing Facial Emotion Recognition systems have evolved significantly over the past decade. Early approaches primarily employed handcrafted feature extraction techniques such as Local Binary Patterns (LBP) and Gabor filters to capture facial texture and motion information [40]. These methods required minimal computational resources but were highly sensitive to environmental variations, including illumination changes, head pose differences, and background clutter. The introduction of Convolutional Neural Networks (CNNs) marked a major advancement in FER research. CNN-based systems automatically learn discriminative features directly from facial images, leading to improved robustness and generalization. To enable real-time performance, several lightweight CNN architectures were proposed. Models such as MobileNet-based architectures [38] and GhostNet [13] significantly reduced computational complexity while maintaining competitive accuracy, making them suitable for deployment on mobile and edge devices. More recent systems focused on improving efficiency and accuracy through optimized architectures and attention mechanisms. MobileFaceNet [7] demonstrated efficient facial analysis on mobile hardware, while attention-based models such as Deep-Emotion [25] selectively focused on emotionally salient facial regions to enhance recognition performance. However, many high-accuracy models remain computationally expensive and require GPU acceleration, limiting their applicability in low-resource environments. Model / Study Year Architecture / Method Dataset Accuracy (%) Inference Time (ms) Model Size (MB) Key Notes Zhao & Pietikäinen [40] 2007 LBP + Dynamic Texture CK+ 84.5 45 1.2 Handcrafted; lightweight but fragile in real-world. Liu et al. [21] 2016 CNN Ensemble FER2013 89.2 38 45 Early deep learning; high resource use. John et al. [16] 2020 MobileNetV3-based FER2013 90.4 25 12 Real-time on mobile CPU. Akhand et al. [1] 2021 Transfer Learning + Deep CNN FER2013 93.5 30 38 Pre-trained for faster convergence. Minaee et al. [25] 2021 Deep-Emotion (Attention CNN) AffectNet 94.1 35 52 Focuses on key facial regions. Li et al. [20] 2022 Lightweight CNN FER2013 94.6 22 8.5 Optimized for edge devices. Han et al. [13] 2020 GhostNet FER2013 92.8 18 6.2 Fastest & smallest; ideal for mobile. Xu et al. [38] 2022 Residual Separable CNN FER2013 91.7 28 15 Efficient feature extraction. Mishra et al. [26] 2023 EfficientNet + XGBoost FER2013 95.3 40 68 Highest accuracy; heavy model. TABLE II Fig. 1. Accuracy comparison of FER models from literature. Fig. 1 compares the recognition accuracy of various FER models evaluated on standard datasets. High-capacity architectures, including attention-based and hybrid deep learning models. However, these gains are typically associated with increased computational complexity and hardware requirements. Lightweight CNN-based models demonstrate slightly lower accuracy but remain competitive while offering improved efficiency. Fig. 2. Inference time and model size comparison of FER models. Fig. 2 presents a comparison of inference latency and model size across different FER architectures. Lightweight models such as GhostNet and MobileNet-based systems achieve substantially lower inference times and reduced model sizes compared to deeper architectures. Hybrid models, while achieving higher accuracy, exhibit increased latency and memory consumption, making them less suitable for real-time deployment on resource-constrained devices. IV. PROPOSED SYSTEM The proposed system, referred to as FeelTrack, is a real-time facial emotion recognition framework based on a lightweight Convolutional Neural Network (CNN). The primary objective of the system is to achieve an optimal balance between recognition accuracy and computational efficiency, enabling deployment on standard consumer hardware without GPU acceleration. The CNN architecture consists of convolutional layers for feature extraction, pooling layers for dimensionality reduction, dropout layers for regularization, and fully connected layers for emotion classification. This design minimizes the number of trainable parameters, resulting in reduced inference latency and improved real-time performance. The model was trained using the FER2013 dataset in combination with a custom dataset containing Indian facial images to improve demographic generalization. All images were preprocessed through grayscale conversion, resizing to 48×48 pixels, and normalization. The trained model achieved approximately 80% accuracy on the test set while maintaining inference latency below 100 ms per frame. The system supports real-time emotion recognition through webcam input and is capable of identifying seven emotional states. Its lightweight design makes it suitable for applications such as classroom engagement analysis, mental health monitoring, and patient observation in healthcare environments.

References

[1] Akhand, M. A. H., et al. (2021). \"Facial emotion recognition using transfer learning in the deep CNN.\" Electronics, 10(9), 1036. https://doi.org/10.3390/electronics10091036 [2] Barros, P., et al. (2021). \"The FaceChannel: A fast and light-weight deep neural network for facial expression recognition.\" SN Computer Science, 2(5), 403. https://doi.org/10.1007/s42979-021-00797-x [3] Bisogni, C., et al. (2022). \"A deep learning-based approach for intelligent healthcare emotion analysis.\" IEEE Transactions on Industrial Informatics, 18(8), 5557–5564. https://doi.org/10.1109/TII.2021.3133816 [4] Caroppo, A., Leone, A., & Siciliano, P. (2020). \"Comparison between deep learning models and machine learning for facial expression recognition.\" In Proceedings of the 2020 International Conference on Image and Vision Computing (ICIVC) (pp. 108–113). https://doi.org/10.1145/3409948.3409966 [5] Chen, B., & Li, Q. (2021). \"Teleconsultation demand prediction based on a hybrid CNN-LSTM model.\" Journal of Healthcare Engineering, 2021, 6694851. https://doi.org/10.1155/2021/6694851 [6] De Sario, G. D., et al. (2023). \"Using AI to detect pain through facial expressions: A review.\" Bioengineering, 10(5), 548. https://doi.org/10.3390/bioengineering10050548 [7] Ding, H., et al. (2021). \"MobileFaceNet: A lightweight deep learning model for face recognition on mobile devices.\" IEEE Transactions on Multimedia, 23, 2578–2590. https://doi.org/10.1109/TMM.2020.3037526 [8] Egede, J., et al. (2021). \"Deep learning for pain analysis: A survey.\" IEEE Transactions on Affective Computing, 12(3), 793–814. https://doi.org/10.1109/TAFFC.2019.2912752 [9] Gaya-Morey, F. X., et al. (2025). \"Evaluating facial expression recognition datasets for deep learning: A benchmark study with novel similarity metrics.\" arXiv preprint, arXiv:2503.20428. https://doi.org/10.48550/arXiv.2503.20428 [10] Goodfellow, I. J., et al. (2013). \"Challenges in representation learning: A report on three machine learning contests.\" In Proceedings of the 20th International Conference on Neural Information Processing (pp. 117–124). [11] Guerdelli, H., et al. (2022). \"Macro- and micro-expressions facial datasets: A survey.\" Sensors, 22(4), 1524. https://doi.org/10.3390/s22041524 [12] Hadjar, H., et al. (2025). \"TheraSense: Deep learning for facial emotion analysis in mental health teleconsultation.\" Electronics, 14(3), 422. https://doi.org/10.3390/electronics14030422 [13] Han, K., et al. (2020). \"GhostNet: More features from cheap operations.\" In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1580–1589). https://doi.org/10.1109/CVPR42600.2020.00165 [14] Huo, H., Yu, Y., & Liu, Z. (2022). \"Facial expression recognition based on improved depthwise separable convolutional network.\" Multimedia Tools and Applications, 82(12), 18635–18652. https://doi.org/10.1007/s11042-022-14066-6 [15] Jin, Y. (2024). \"Advancements in facial expression recognition: A comparative study of traditional machine learning and deep learning approaches.\" In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence (EMITI) (pp. 311–315). https://doi.org/10.5220/0012937300004508 [16] John, A., et al. (2020). \"Real-time facial emotion recognition system with improved accuracy.\" In Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 729–735). https://doi.org/10.1109/ICSSIT48917.2020.9214150 [17] Kong, Y., et al. (2021). \"Lightweight facial expression recognition method based on attention mechanism and key region fusion.\" Journal of Electronic Imaging, 30(6), 063002. https://doi.org/10.1117/1.JEI.30.6.063002 [18] Krumhuber, E. G., et al. (2017). \"A review of dynamic datasets for facial expression research.\" Emotion Review, 9(3), 280–292. https://doi.org/10.1177/1754073916670022 [19] Li, S., & Deng, W. (2017). \"Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild.\" In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2584–2593). https://doi.org/10.1109/CVPR.2017.277 [20] Li, Y., et al. (2022). \"A lightweight convolutional neural network for real-time facial expression recognition.\" IEEE Access, 10, 55679–55690. https://doi.org/10.1109/ACCESS.2022.3176790 [21] Liu, K., Zhang, M., & Pan, Z. (2016). \"Facial expression recognition with CNN ensemble.\" In Proceedings of the 2016 International Conference on Cyberworlds (CW) (pp. 163–166). https://doi.org/10.1109/CW.2016.32 [22] Llurba, C., & Palau, R. (2024). \"Real-time emotion recognition for improving the teaching–learning process: A scoping review.\" Journal of Imaging, 10(12), 313. https://doi.org/10.3390/jimaging10120313 [23] Lucey, P., et al. (2010). \"The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression.\" In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 94–101). https://doi.org/10.1109/CVPRW.2010.5543262 [24] Mattioli, M., & Cabitza, F. (2024). \"Not in my face: Challenges and ethical considerations in automatic face emotion recognition technology.\" Machine Learning and Knowledge Extraction, 6(4), 2201–2231. https://doi.org/10.3390/make6040109 [25] Minaee, S., et al. (2021). \"Deep-Emotion: Facial expression recognition using attentional convolutional network.\" Sensors, 21(9), 3046. https://doi.org/10.3390/s21093046 [26] Mishra, A. K., et al. (2023). \"EfficientNet-XGBoost: A novel transfer learning-based method for facial emotion recognition.\" Mathematics, 11(3), 776. https://doi.org/10.3390/math11030776 [27] Mollahosseini, A., et al. (2019). \"AffectNet: A database for facial expression, valence, and arousal in the wild.\" IEEE Transactions on Affective Computing, 10(1), 18–31. https://doi.org/10.1109/TAFFC.2017.2740923 [28] Pandit, M., et al. (2022). \"An enhanced deep learning approach for intelligent healthcare emotion analysis using facial expressions and feature analysis to identify pain.\" Robotica, 43(4), 1512–1532. https://doi.org/10.1017/S0263574724002224 [29] Picard, R. W. (1995). \"Affective computing.\" MIT Media Laboratory, Perceptual Computing Section, Technical Report No. 321. [30] Picard, R. W. (1997). \"Affective Computing.\" MIT Press. https://doi.org/10.7551/mitpress/1140.001.0001 [31] Pramerdorfer, C., & Kampel, M. (2016). \"Facial expression recognition using convolutional neural networks: State of the art.\" arXiv preprint, arXiv:1612.02903. https://doi.org/10.48550/arXiv.1612.02903 [32] Qi, X., Zhang, C., & Xu, L. (2024). \"Survey on deep learning based face expression recognition methods.\" In Proceedings of the International Conference on Artificial Intelligence and Communication (ICAIC) (pp. 368–375). https://doi.org/10.2991/978-94-6463-512-6_48 [33] Saadon, J. R., et al. (2023). \"Real-time emotion detection by quantitative facial motion analysis.\" PLoS ONE, 18(3), e0282730. https://doi.org/10.1371/journal.pone.0282730 [34] Singh, P. K., & Kaur, R. (2022). \"Facial emotion recognition using deep learning detector and classifier.\" International Journal of Electrical and Computer Engineering, 12(6), 6425–6433. https://doi.org/10.11591/ijece.v12i6.pp6425-6433 [35] Tzirakis, P., et al. (2017). \"End-to-end multimodal emotion recognition using deep neural networks.\" IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1309. https://doi.org/10.1109/JSTSP.2017.2764438 [36] Verma, M., et al. (2023). \"Efficient neural architecture search for emotion recognition.\" Expert Systems with Applications, 223, 119934. https://doi.org/10.1016/j.eswa.2023.119934 [37] Wang, S., & Wu, H. (2019). \"Facial expression recognition based on CNN.\" Journal of Physics: Conference Series, 1239(1), 012008. https://doi.org/10.1088/1742-6596/1239/1/012008 [38] Xu, X., et al. (2022). \"A facial expression recognition method based on residual separable convolutional neural network.\" Journal of Network Intelligence, 7(1), 60–71. [39] Zhou, N., et al. (2021). \"A lightweight convolutional neural network for real-time facial expression detection.\" IEEE Access, 9, 7481–7491. https://doi.org/10.1109/ACCESS.2020.3046715 [40] Zhao, G., & Pietikäinen, M. (2007). \"Dynamic texture recognition using local binary patterns with an application to facial expressions.\" IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 915–928. https://doi.org/10.1109/TPAMI.2007.1110

Copyright

Copyright © 2026 Harsh Kumar, Anuttama Dwivedi, Aayush Singh, Anuj Singh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET80335

Publish Date : 2026-04-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here